The Internet of Things is now ubiquitous, but even twenty two years after its introduction, much of it, even the simplest parts, remain much less customizable than many of us would like. Luckily, most of us (yes, you!) can do something about that.
There are times when every original intention you had for a blog post winds up on the cutting room floor. This is one such post. It began simply enough - discuss the Internet of Things, create a project, share the project, eat some pie. But the further I got with my project, the deeper I dug, and the further down the rabbit hole I found myself.
The project idea is simple enough - make a device that listens for a keyword or wake word(s), and when it hears it, executes some simple code, sending a command to some other device. There are devices from major companies that do this, and millions of people have them, so why don’t I just buy one of those?
You’ve met me, right?
For most people, there are three ways to verbally access the Internet of Things. You either say “hey Google,” “hey Siri,” or “Alexa”. But what if you want to change your wake word? Well, if you have a Nest Hub, you could change your wake word to “OK Google.” On an apple device you could switch from “hey Siri” to “OK Siri.” But what if you have an Amazon Echo? Well, the world’s your oyster my friend! You have the freedom to change your wake word to any word you choose, just as long as the word you choose to change it to is either “Amazon,” “Echo” or “computer.”
Now admittedly that last one is great if you want that Star Trek feel around your home, but none of these devices come anywhere near what I would call customizable. I want a device that allows me to set any wake word I want, or even any command word I want. None of this is going to be possible on any of the aforementioned devices, but with a Raspberry Pi and some Python code, you control the wake words, the control words, and pretty much any verbal input you want. Want the name of your estate to be a keyword? No problem. What about your favorite toy from when you were a kid? Or that classmate you had a crush on in third grade? It’s all possible now.
Having worked with Pete Warden at the ARM AIoT Dev Summit, presenting a session on building a Harry Potter-esque wand using TensorFlow Lite (back when we were traveling and attending events), I knew that chapter 7 of Pete’s book TinyML focused entirely on wake word detection. Having worked with TensorFlow, I also knew that it took quite a bit of time to train a model, and works best using multiple voices to gather audio samples for training.
TensorFlow definitely yields solid results, but I wanted something simpler and less time consuming, so I started hunting around. I knew that I could probably use a speech-to-text service like Google Speech-to-Text, Microsoft Azure Speech, or IBM Watson Speech to Text, but I didn’t want to have to worry about a streaming API. Further down the rabbit hole.
While I know there are more options, I looked into three wake word engines - PocketSphinx, Carnegie Mellon University's open source, large vocabulary, speaker-independent continuous speech recognition engine; Kitt.ai’s hotword detector package Snowboy; and from Picovoice.ai, their wake-word engine Porcupine. All come with pros and cons, and here’s what I learned.
For custom word recognition, training your model(s) will be compulsory, and can be incredibly time consuming. In CMU’s Sphinx training tutorial, they require a full hour of audio data for command and control usage by a single speaker. If your application is going to require command and control usage by multiple people, you’ll need five hours of audio data provided by two hundred separate speakers.
For Snowboy’s Hotword Creator, you only need five hundred of your closest friends to record three samples each. If I can’t muster two hundred people to help me build a wake word for PocketSphinx, chances are slim to none that I’ll be able to come up with two and a half times that many to help me out with Snowboy.
That leaves Porcupine. Apparently PicoVoice’s mantra seems to be “No friends? No problem!” The PicoVoice console simply asks you to type in a wake word phrase, choose a language from the four they offer (English, French, German or Spanish), and hit the “Train Wake Word” button. In about three hours you’ll receive an email letting you know that you can download your data set. However, even before it’s completed, you can test it on the console. I was amazed and how well and how quickly my phrase was recognized.
CPU usage varied quite a bit between the three engines. PocketSphinx, arguably the most flexible of the three, was also the hungriest. Snowboy sat on the high side of center, and Porcupine came in at just 12 percent of what PocketSphinx used.
What good is speech recognition if your speech isn’t being recognized? False positives and negatives greatly affect performance, and when benchmarked, the results also swung fairly broadly.
Full disclosure, these findings were based on tests run by PicoVoice, so one might think that the results might be skewed, like drug side-effect studies paid for by the manufacturer of the drug. However, they are completely transparent in their methodology, and I was able to find comparable results when the engines were tested by Rhasspy.
Having been created in an educational setting, PocketSphinx is more than just a speech to text engine, it is a research system, so it will most likely continue to improve and expand. For an educational entity, it definitely should not be overlooked. As far as Snowboy is concerned, it is a decent engine, however, KItt.AI announced early last year that they would be shutting down all of their products as of December 31st, 2020. It remains in their Github repository, and can still be used and improved, but any support is now purely community-based. With it’s low CPU usage and extremely high accuracy, PicoVoice’s Porcupine is the engine I chose to use for my project. That being said, there are still some things to be aware of if you decide to use Porcupine for any IoT projects. First, you’ll need to create an account with PicoVoice. I know that there can be some aversion to that, but isn’t that what your Yahoo account is for? Then there’s the decision that comes with your account. PicoVoice offers two options for your account - you can create a Personal Account or an Enterprise Account. Speech models created with a Personal Account cannot be used for any commercial applications. This account type is designed for researchers, hobbyists, tinkerers, and educators. Additionally, wake words created with a Personal Account expire after thirty days. By contrast, an Enterprise Account allows for commercial use and distribution, and the wake words do not expire. However, if you’re a small startup or single-employee company, the $400/month (charged annually) might be a bit steep, but it is important to note that they do offer a “Startup Discount”.
After all of this digging and researching, I have to say I’m very happy with Porcupine as a wake word engine for my project. Oh right, my project… My IoT project blog post will have to wait for another day, but do stay tuned, as this potential money-making project, which is definitely not a cryptocurrency harvester, promises to be much more fun than training a custom wake word for five hours! And if a bit more research on the front end can yield a more efficient project on the back end, then I believe that it's time well spent.
I need voice recognition on a Pi-based system and have been fooling with the Vosk software. So far, I'm cheating: the Vosk server is running on my desktop Linux box and the Pi just pumps audio samples to it and parses the returned text strings. The small Vosk models (several languages available) are supposed to be capable of running on a Pi.
In a controlled environment it's not too bad, but as I have it configured not really good enough to trust with wake words. It's a large vocabulary recognizer. One of the things on my to-do list is to look into giving it a much more limited dictionary. This is apparently straightforward -- not the same as simple -- and is supposed to improve accuracy.
The standard computer science joke from decades ago was "I may be artificially intelligent, but I refuse to wreck a nice peach!"
I'm still happy being able to reach the X-10 controller from my bed to turn the lights on/off, and I'm certain that it's not going to be held for ransom, even if the culprits sharpened their skills on a certain petroleum pipeline...