The Hidden Sound of Trust

Trust isn't just built through words. In voice AI, tone, rhythm, and pronunciation shape every interaction, making acoustic design as important as the conversation itself.

The Hidden Sound of Trust

Now that we are completely immersed in the era of Large Language Models and Agentic AI, our industry needs to redefine what 'good' actually looks like. It’s no longer just about checking a box to see if a bot successfully used an API, matched the golden' answer, or retrieved the correct data. It’s about comprehensive, continuous testing to ensure these systems respect human social dynamics. Because even in this new era, conversational quality and trust are non-negotiable.

It was exactly this human-first mindset that echoed through the beautiful, historic arcades of the Museu Marítim during the Beyond Boundaries Festival in Barcelona (April 28–30). Inspired by those sessions, I’ve written a series of blog posts capturing the keynotes that best define this new era of AI.

Shauna Griffin

Shauna Griffin, director of product at Rasa, focused her keynote entirely on the acoustic, audio-first side of voice agents. Since I am working on a voice project at the moment, I was all ears.

Shauna shared that right now, as a conversational industry, we spend about 95% of our time focusing on text. We obsess over prompt engineering, fine-tuning, and writing the perfect scripts. But if you are building a voice agent, the text is only half the battle. Let’s zoom into that.

In speech, how something is said matters just as much as what is said. Intonation, rhythm, phrasing, and emphasis carry a lot of social meaning. When we evaluate voice systems through the Conversational Capital framework, Shauna's keynote proved that a text-perfect LLM response can still damage your Trust Ledger if the vocal performance fails.

Here are my main takeaways from this ‘ear-opening’ keynote:

1. Uncanny feeling

A few years back, conversation designers spent hours tweaking words or part of words just to make sure it was pronounced correctly by the text-to-speech engine. I remember having to spell certain words in the weirdest, more unnatural ways just to trick the system. It was far from efficient. Back then, we couldn’t really choose a customer voice, we were limited to basic gender options and we had zero control over the tone.

Shauna dived into the fascinating mechanics of modern, generative Text-to-Speech (TTS) and how we and how we can now actually use prompts to shape vocal style, emotion, and emphasis. It’s no longer just about choosing a generic synthetic voice. It’s about directing a live vocal performance.

In the language of Conversational Capital, poor voice synthesis creates a massive tax on the user's attention, known as cognitive load. If a voice agent delivers very serious, urgent account information in a flat, cheerful but robotic way, the brain senses a mismatch. That weird, uncanny feeling causes skepticism, leading to a direct withdrawal from your corporate Trust Ledger.

2. The phonetic audit

To spot any mismatches in the direction of your vocal performance, Shauna emphasizes the necessity of phonetic testing. A sentence might look flawless on a screen, but when a TTS engine pronounces a brand name incorrectly, stresses the wrong syllable, or misreads an abbreviation, the illusion of human-like interaction drops. 

The moment that happens, the user will probably get frustrated, state that they want to speak to a human, and your brand builds massive content debt. The golden rule here? Listen to the actual audio output, don't just read the text logs!

3. Pick the right voice

A voice can make a customer feel immediately safe, or instantly frustrated. When an enterprise designs or picks a voice that perfectly matches the emotional context of the user's situation (calm and trustworthy for an insurance claim, energetic for a restaurant booking) it acts as a huge capital deposit into the Trust Ledger. 

Shauna’s keynote was the perfect reminder that in a voice-first world, your design work isn't finished when the text is finalized.