Android and Linux

Saturday, October 30, 2010

Speech recognition on Linux, through cheating

Speech recognition on Linux is hard to come by. I've been looking for a long time and never found anything that is very useful.

Most of the solutions available are nothing more than libraries for developing speech recognition programs which require you to build your own language models. They're mostly meant to be used in other software. Many of them look like they're very well done but I'm not a credit card company looking to set up a call canter so incorporating them are well above my ability. After playing around with a few of these, I decided they were a no-go.

I was hoping there would be a solution in the cloud, like an API to Google's online voice recognition engine, but Google doesn't have one and there don't appear to be any others available either.

So I decided to cheat. All I really need is to be able to say a few simple words and have it translated to a text file. I have voice recognition on my phone that does that, so why not use it?

To avoid repeating myself, this, like everything else on my blog lately, uses ssh with keys, Tasker with the Locale Execute Plugin, the Google voice recognition API called with Python that appear in my last few posts and a short script on the phone.

The short script we need is this, which I will call "vhome" for my examples:
#! /system/bin/sh
cat /sdcard/.voice | ssh USER@IP -i /PATH/TO/SSH/KEY 'cat > /PATH/TO/A/FILE && /PC/COMMAND/SCRIPT'
All this script does is place the text from the /sdcard/.voice file into a file on your computer. From my previous posts, it should be clear how Tasker/Python/Google all work together to get your words into a file on your phone and this simply gets that file to your computer.

The "&& ..." is optional and used for executing a script on the computer which will do something with the text. This example assumes you do want to use that because I didn't pre-filter it on the phone so the file on the computer will literally contain "computer [commands to be carried out]".

The script on the computer would get rid of the first word and parse the rest for keywords linked to commands to execute. It could be as simple as one if/elif loop to look for matches and run commands. This can then be easily added to for new commands by inserting a new elif/then line for the keywords to match and commands to run.

Now it's as simple as setting up the Tasker task to trigger the script on a keyword, like "computer" for example and say "computer [commands to be carried out]".

An example of using this would be saying "computer search more common hades." The script on the computer would get rid of "computer" and using an if/then loop would recognize "search" as a keyword, read the rest of the text into a variable and run this command:
firefox "http://www.google.com/search?q=$VAR"
The biggest downside is speed. It takes about 8-9 seconds from start of the voice API to command execution on the computer. One way to make this useful may be to set up a task in Tasker that keeps looping until you say a keyword to stop it. A quick example would be:

1- Write File .voice (this is used to zero out the file)
2- Run Script voice.py
3- Read Paragraph file: .voice to %VOICE
4- Goto Action 3 if %VOICE matches EOF
5- Stop If %VOICE matches halt
6- Locale Execute Plugin execute vhome If %VOICE matches computer
7 Goto 1

This task would show the voice recognition prompt. When you say "computer [commands to carry out]", it will execute vhome then show the voice recognition prompt again, unless you said "halt", in which case it dumps out of the loop and exits. Using this, you can place the phone on your desk and have voice commands at the ready.

So, that's about all. You now have speech recognition on Linux, thanks to cheating.

If you're also interested in speech synthesis on Linux, I suggest grabbing a good TTS engine like Festival and replacing the default diphone voice with a better one. Some of the most popular are the ARCTIC voices from Carnegie Mellon's Language Technologies Institute or the HTS version of those voices from the HTS working group at Nagoya Institute of Technology. The voice I use is "cmu_us_slt_arctic_hts". I haven't tried many TTS engines because Festival seems to do everything I want, especially taking text piped to it from a shell. This makes it pretty easy to set up a script that can be called to make the computer talk on certain events. You can add it to the examples above so your computer says "Yes sir, I'm opening the search results for More Common Hades."

Followers