As conversation designers, we design the best we can. We follow the steps in the conversation design workflow. There is canvassing the user needs, sample dialogue, and validation through Wizard-of-Oz testing. We take that output and build it the best way that we can. Then it's time to go live. We want real users to interact with it and collect all sorts of data that hopefully helps us improve the experience. But what are some of the metrics for your chatbot and voice assistant that you need to be looking for?
There are many things to measure in your chatbot and voice assistant. It depends per team and organization which they want to focus on. Some might be more interested in call deflection in customer service, where marketing bots are more interested in engagement, and sales bots are more interested in conversion rates.
Let’s go over some useful metrics that you can track to create better conversational experiences.
1. Number of chats per day
It is what it says. Pretty straightforward. This metric tracks how many chats there are per day. It depends per deployment on whether or not you want this number to go up.
When people reach out more to the customer service chatbot, that could mean that things on the website are unclear. On the other hand, if we are a quiz on the Google Assistant, then we like this number to go up because it means more people are enjoying the experience. Also, if you have a chatbot that is in charge of lead generation, then you obviously want to see this number go up as well.
Keep a close eye on the number of chats per day, but have a clear understanding of the goal that this particular AI Assistant has.
2. Average confidence score
This metric tells you how confident the AI is about its own recognition. So your users say something and the assistant has to figure out what it says. It matches the input with its model and then has to present an answer. The confidence score tells you how confident the assistant is that it properly understood the user.
The confidence score gives a good first impression of the model, but it doesn’t always tell you the full story. It’s like asking employees to evaluate themselves. For example, when the confidence score is 50% you don’t know much yet. Is the confidence at 50% because people are asking the assistant all kinds of weird things that are totally out of scope, or has the model not been properly trained, and does that need lots of improvement?
A next step in figuring this out is to look at both the internal and external accuracy of your NLU model.
3. Internal NLU accuracy
When you’ve created a bunch of intents, and added training examples, you can test those examples against each other. You do this via k-fold testing; you divide your training set into ‘k’ amount of sets (folds) and test each set against the rest.
This shows you if the training examples for each intent form a homogenous set, or if there’s confusion between intents - because some of their training examples are too similar.
This is especially useful before you have launched your AI assistant because you don’t have any ‘real’ chatlogs yet. But even after launch, it's good to run a k-fold test after each (big) change you have done. Better yet, run it right before you bring a change live, to check if this change improves or hurts your accuracy.
K-fold testing results in these metrics; precision, Recall, F1 score, and % true positives. We won’t dive into them right now but know that together they form a comprehensive insight into the robustness of your training set.
K-fold also results in a confusion matrix. This shows these metrics for each intent. It also shows if other intents are ‘stealing’ its training examples.
It’s up to your AI Trainer to determine if this is because one intent is undertrained, another overtrained, or if the intents to close to each other, or if the example was wrongly assigned in the first place. A confusion matrix pinpoints exactly were the issues are in your training set and give actionable insight.
4. External NLU accuracy
Even when you have a perfectly trained set of intents, the way your customers ask questions changes all the time. The best way to know performance beforehand is by labeling real-life user questions – just like you find them in your chatlogs – and using this ‘blind-test’ to test your assistant. With labeling, we mean that you assign each user question to the intent it should match to. Or, when it’s out of scope, we have to label it as such.
You want to keep these labeled questions separate from your training set. Maybe even separate from the AI Trainers, so they’re not influenced to adapt the training set to fit the test.
Assume that you need at least 1000 labeled questions per version of your blind test set. Then you also know that you will have to create new versions of this set periodically. Every time you add, remove, widen, or narrow (the scope of) intents, you will have t
o go back and update all labeled questions. That means this is a labor-intensive process that’s usually reserved for a mature AI Assistant with a big scope and team.
5. Handover percentage
The handover percentage is the metric that tells you how often a conversation has been handed over to a live channel. In general, we want simple and repetitive tasks to be handled by the bot, and more complex queries are still handled by live agents.
It depends on the scope of your project which number is acceptable here. When you are dealing with complex products, then you will accept a higher handover percentage. When you are dealing with very simple and straightforward products, then you might want to see a containment of 80%.
6. Containment rate
Many enterprises focus on containment. They want to keep people in the chatbot. When people decide to reach out to the AI Assistant, then it should be the goal to have the matter resolved on that channel. A high containment rate is what we aim for. Most companies like the containment rate to be at 80%. However, as our AI Assistants become more transactional and the conversations get longer, you can expect the containment rate to drop over time.
7. Customer satisfaction
Every company is interested in the customer satisfaction score. It is an important metric for your chatbot or voice assistant. However, it’s not always easy to measure. People tend to only answer a question about satisfaction when they are not satisfied.
The best way to understand customer satisfaction is by asking about it in a conversational way. Don’t just ask thumbs up or thumbs down. But at the end of the conversation, really ask them how they appreciated the service, if they are satisfied with the experience, and if they would recommend it to others. This is going to get you a better understanding if people are actually enjoying your chatbot or voice assistant.
8. Number of live engagements
This is the number of live conversations. These can be by phone or live chat. This number isn’t tracked within your conversational platform, but it is an important metric to evaluate the overall effectiveness of your bot.
Most companies launch a bot to reduce the number of live engagements. Live engagements are much more expensive than API-calls so they want to automate as many as possible. You want to understand how your AI Assistant is influencing your live engagements. You want this number to go down. But at the same time, expect the average handling time to go up.
9. Average handling time
Average handling time is a standard metric in contact service land. It tells you the average time a live agent spends with a customer trying to resolve an issue. When you deploy a customer service chatbot, you’re automating simple and repetitive tasks and questions. The more complex issues are still handed over to the live agent. Complex issues usually result in longer conversations, which means that you will see the average handling time in the service center go up.
There are many things to track. Data is everywhere. Every interaction people have with your AI Assistant produces data. As a team, you want to use this data to improve the performance of your chatbot or voice assistant. However, it’s important to understand that the interpretation of metrics is a whole discipline in and of itself.
Most of the time, the AI Trainer specializes in understanding this. He gets the nuances behind a certain number and knows which decisions were made during the implementation and training of the model. Keep track of data. Use it to improve the operation but don’t just follow along blindly. Always double-check what’s going on. It usually isn’t as simple as you would like it to be.
P.s. there are more metrics that teams follow. Some of them are useful and some of them are not. If you feel strongly about certain metrics, just shoot us a message. Let’s turn this post into a living document.