07 Feb 2018 Tips and tricks for building an Alexa skill
The skill will allow Energy Australia customers to ask Alexa for information regarding their bills, and to get tips on how to minimise their energy usage. In this blog post I’ll give an overview of our solution, and outline some of the tips and pitfalls we discovered during development.
Before we get into the tips and tricks, I’ll first give you an overview of our solution.
For those new to Alexa skill development, all Alexa skills consists of two main components: a model and a backend service. The model can be uploaded to Alexa through the developer console, and describes the intents a user can express, along with sample utterances for them. It also allows you to define ‘slots’ for custom data that the user is providing with the intent.
Once Alexa identifies a user’s intent and slot information, it calls the backend service. The service performs the necessary interactions with external APIs, then generates a response to be spoken back to user.
Users are also able to link their Alexa account to other third party accounts using OAuth. Setting this up was pretty straightforward, especially if you’ve setup OAuth in the past or your application supports it already.
We used the model builder provided by Amazon through the Alexa developer console to define our basic intents. In our case our intents were:
- Retrieve bill information,
- Get energy tips, and
- Get contact information
Even defining the slots needed by the bill info intent was relatively easy. A custom slot type was created to represent the fuel type (either electricity or gas), and a number of variations of both. In one of the early versions of the model we also defined a number of sample ways with which Alexa could ask if the user hadn’t explicitly stated which fuel they were interested in.
The backend service
The backend service for our skill was written in Python using the flask-ask library and run as an AWS Lambda function, the deployment of which was managed using the Zappa serverless framework. This set of tools was chosen as they allowed us to quickly develop and deploy the skill in our existing AWS environment without the need to maintain EC2 instances or container clusters.
flask-ask did most of the heavy lifting in terms of providing convenient annotations for defining the handlers for each intent. The built-in annotation were augmented to include parsing of the JWT provided by Alexa that contained the user data.
Tips and Tricks
Now that you’re familiar with the structure of our solution and the high-level technical choices we made, let’s get into some of the issues that we faced when putting it all together.
The first hurdle we ran into concerned accents. Specifically, whilst the voice model was targeting the Australian accent, my accent is from the UK. This meant that sometimes it would mis-interpret some of the things I said when attempting to fill the slot. The most common error was “gas” being interpreted as “yes”.
UK accents are a little different from Australia accents, but not that different. So to work around the issue, we added in synonyms for the problematic slots. This saved the user having to keep repeating themselves. Similarly, for the intents, we adjusted the sample utterances to make sure that they were distinct and had little overlap.
Finally, we worked with some consultants from Amazon, who did some specific behind-the-scenes training of the voice model to improve the recognition of words that were important to our skill. This was particularly useful as the Australian release of the Echo was imminent, and it was important that Alexa be able to decipher Australian accents.
Confusing Custom Slot Prompts
When developing an Alexa skill, you can configure it to use custom user prompts to fill particular slots. However, it proved to be a little tricky to re-prompt the user in a natural way without breaking the flow of the conversation.
In the ended we dispensed with delegating to Alexa to fill the slots in our intents, and instead handled these manually ourselves in our Lambda function. We also made some modification in the backend service to make certain common misinterpretations synonymous. This avoided having to continually re-prompting the user.
When the user says something unexpected that veers from the course of the expected interaction model, the default response can be disorienting. For instance, with our basic model, if the user opened the Energy Australia skill and asked for ‘bananas’ (yes that actually happened during testing), Alexa would simply beep and close the skill without some much as a word. This was not a good experience for the user.
To deal with this scenario, we developed a ‘catch-all’ intent. This was an intent whose sample utterances were both far from the things we expected the user to say and as all-encompassing as possible. To build the large number of utterances needed for the catch-all intent, large number of short phrases including, memes, song lyrics and popular catch phrases were used.
We also ran into a couple issues when specifying the sample utterances for each of the intents. It is encouraged to the add as many sample utterances as possible to ensure the highest possible match, and there are several really good tools available online which can be used to generate these.
However, it is important to make sure that there is little to no overlap between the utterances used for intents and for slot filling. For example, the utterance “about my electricity” can be used to establish an intent, but “my electricity” could also be used for slot filling. So despite us knowing they were contextually separate, Alexa would occasionally match the slot filling action at the wrong point in the conversation, resulting in an error response.
We resolved this by reducing the number of utterances for slot filling, and ensuring the sample utterances were as distinct as possible.
One of the biggest concerns with developing a voice interface is latency. In our case, we were primarily concerned with how long it took to return a response to Alexa. The total timeout is around 5 seconds, after which Alexa gives up and informs the user that the skill took to long.
Whilst a five second delay can be a little annoying to the end user of a traditional visual interface like a website, it’s not a deal-breaker. With a voice interface, a five second delay can feel much worse, as people aren’t used to having to endure five seconds of deafening silence when they ask a question to a real person. At the very least, a real person might utter ‘umm’ or provide some visual cue to indicate that they are thinking. Whilst the light at the top of an Echo will flash as it processes input to indicate that something is happening, it’s still preferable to not make the user wait too long.
Unfortunately, in an enterprise system designed for traditional visual interfaces, five second delays are not uncommon. We ran into this when connecting to our existing on-premise CRM system, which only had a few SOAP web services available to obtain user data. This was compounded by the fact that the data we needed was split across a number of very expensive web service calls.
We partially mitigated the problem using the usual tricks employed when developing backend services: reusing sockets and configuring a cloud watch event to keep the lambda warm to avoid any potential startup time. However, the real time saver was when we modified the backend to make as many SOAP calls in parallel as possible. This brought the skill response time back to a more acceptable level.
There are certainly a number of pitfalls when developing an Alexa skill, such as the confusion between slot filling utterances and intents, and the mis-interpretation of certain accents. However, this will likely be ironed out as the platform matures.
Voice interfaces provide a novel and natural method for users to interact with their account, and we look forward to extending this Alexa skill to provide EnergyAustralia’s customers with even more helpful and detailed insights into their energy accounts.