How to build your own Alexa-like personal assistant

Voice and natural language serve up the UI of the future. Here's how to incorporate them into your applications, without relying on someone else's API

How to build your own Alexa-like personal assistant
Thinkstock

Voice and natural language systems are an important step toward making our digital servants serve us on our terms. We went from punch cards to green screens to GUIs and eventually to touch-based, palm-sized, location- and context-sensitive computers in the form of smartphones (not to mention those annoying smart-car panels). Now we have Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana, and Google’s Assistant answer our needs.

To build voice and natural language capabilities into your own applications, you have several cloud options. For Alexa, you can tap into an open API at no apparent cost beyond AWS charges; the same goes for Google, although the Google Cloud site is as clear as mud on this point. Microsoft even lets you reuse your Alexa skills package with Cortana. For Apple, there’s an API, along with the $99 cost of becoming an Apple Developer and publishing an iOS app.

But why lock yourself into Amazon’s or Apple’s or anyone else’s platform to get these capabilities? Anybody can build their own system to voice-enable their devices, websites, or gadgets today. It’s a matter of speech to text, a query parser, a pipeline, a rules engine, and a pluggable architecture with open APIs. (Full disclosure: I work for Lucidworks, a search technology company with a product that covers most of these tasks.)

intelligent personal assistant block diagram Andrew C. Oliver

How an intelligent personal assistant works

Speech to text

I remember when I first saw IBM’s Windows 95 voice-enabled Aptiva desktop computer that let you control your computer with voice commands. The voice interface was a bit clunky because Windows 95 wasn’t really designed with voice commands in mind—but it made a hell of a demo!

These days you have your pick of speech recognition libraries or cloud solutions. You can (and have been able to for a while) embed them into anything. Some packages are even accurate.

Text to speech

Speech synthesis has existed since we first had sound cards. Heck, I vaguely remember DOS libraries that could do horrible things with the onboard speaker that claimed to be speech. Most modern operating systems from Android to Windows to OS X have built-in APIs for speech synthesis.

Query parser

Once speech has become text, most of the real work is done by the query parser. This turns words into root words (“stemming”) and words into phrases. Query parsers (such as Extended DisMax) have come a long way from even a few years ago.

In the old days, asking even Google a question meant either doing a pure keyword/term search or learning a somewhat byzantine syntax and composing queries like (+“this phrase in the document” AND -“this phrase in the document”) OR (“something that may be in the document” AND -”this shouldn’t be there”). Now you search for stuff in something as close to “plain English” as possible.

To a large degree, the new query parsers moved the smarts out of the developer’s UI and into the search engine itself.

Pipeline

A lot of tasks may need to be performed on a query before we pass it to either custom plugged-in commands (“skills”) or execute a search against our index. Moreover, special results (such as “restaurants in my area”) need different processing than run-of-the-mill search results before we return them to the user. To do this appropriately, you need some kind of pipeline for queries coming in and results coming out.

Rules and/or domain-specific language

Some items really are a series of if-then-else statements. When someone asks for the “about page,” send them to /about.html. When a query contains “weather,” call the weather service.

Other details are a sort of “domain” or a combination of a rule and a domain, such as “recipes for tarts containing cherries” or “speech recognition libraries in C.” For these, you might map them as searches where title=“* tart *”, document-type=recipe, ingredients=cherry.

Tagging/natural language processing

For truly flexible search, you need software that can map unstructured data into reasonable, searchable structures. This means when data is indexed, it should know that when this linked document is parsed, the “entity” mentioned is Google or Alphabet and the document type is an SEC filing of the subtype 10-K. This requires recognizing these notes and “tagging” them.

For a human-friendly search, the system needs to recognize parts of speech. “10-K reports about Google” and “10-K reports mentioning Google” are two different matters. This requires parts-of-speech tagging, potentially at index time, but may also require natural language processing at query time.

Pluggable architecture

In general, “build a modular architecture” is another way of saying “don’t make software that sucks.” Change is the only absolute constant. All major vendors have a way to plug new functions into their Alexa-like creation.

With modules you usually get some way of “discovering” the new functionality. This is nothing new. It means having a decent API with a descriptor or metadata explaining how to plug the functionality in and what it does.

Open APIs

If you’re coding in today’s world, an “open API” should mean a REST API. You should be prepared to receive JSON over HTTPS and emit the same. You don’t know what new stuff the future holds, so build for resiliency.

Why roll your own?

Maybe your homegrown Alexa is all behind your firewall. Maybe it’s a limited-function, site-specific system to provide a new way for e-commerce customers to find what they need on your site or at an in-store kiosk. Maybe your assistant is more of a shop floor device to locate manufacturing equipment. One can always name a reason to roll one’s own.

Whether you’re doing it yourself or plugging into the new world of cloud-based personal assistants, you have decades of libraries and expertise to build on. Alexa and her ilk were inevitable—and building your own version is well within reach.