Game development

How Sonos built its voice assistant

Hello and welcome to Protocol Entertainment, your business guide to the gaming and media industries. This Thursday, we explore how Sonos built its voice assistant and why Amazon didn’t use computer vision for its new Glow projector. Also: Time to take a deep breath.

Hey Sonos: The work needed to build the company’s new voice assistant

Sonos quietly began rolling out its voice assistant to select people in the US this week, days ahead of its official June 1 launch date. Sonos Voice Control is purpose-built for music playback, and it comes with strong privacy safeguards: unlike Alexa or the Google Assistant, it doesn’t upload any voice recordings to the cloud, but processes everything on the device .

I spoke with Sébastien Maury of Sonos, Senior Director of Voice Experience for the company, and Kåre Sjolander, European Head of Text-to-Speech for synthetic voice specialist ReadSpeaker, to learn more about the work involved in wizard building.

Give a voice to the Sonos Assistant. Sonos has partnered with ReadSpeaker to generate its unique assistant voice profile, based on “Breaking Bad” actor Giancarlo Esposito.

  • Esposito spent approximately 40 hours in the studio recording thousands of phrases and expressions which were then used as training data for the vocal model.
  • Much of the recorded material was not music specific at all. “Basically, we have a basic standard script for English,” Sjolander said. “It’s short sentences, longer sentences, numbers. All different types of materials.
  • ReadSpeaker has also included a bunch of Spanish vocabulary in Esposito’s script to improve pronunciation of the vocal pattern of Latin artists and songs. “He actually had a Spanish coach while recording,” Maury said.
  • Esposito has also been asked to read some Sonos Assistant-specific material, but even those sentences aren’t used 1:1. Instead, it’s all AI fodder. “You’re building a model of the actor’s voice, which should be able to say anything,” Sjolander said.
  • There’s one notable exception to this: when people summon the assistant with the phrase “Hey Sonos” and don’t follow anything, you’ll hear Esposito’s real voice say “Yes?”
  • “We wanted to have a very specific intonation for it,” Maury said. “Like someone who’s kinda bored… ‘Okay, come on!'”

Make sure the assistant understands you. Having an assistant who responds with a synthetic voice is only half the battle. Getting it to actually understand the requests is just as important – and even harder if it’s done locally on the device.

  • Amazon and Google use cloud-based speech recognition for their respective assistants and actually have humans review small subsets of these recordings to improve accuracy.
  • However, there has been some backlash against this practice, which is why Sonos decided against it. Instead, Sonos uses voice recordings from its opt-in community of beta users to train its assistant.
  • The company has also partnered with external contractors for additional recordings. “We give them a script and we put together [training] data,” Maury said.
  • Sonos plans to continually update this data to account for new artists, oddly pronounced song names, and other edge cases.

Focusing on a single use case makes things a little easier. Sonos Voice Control won’t need to tell people about their weather or where they are going, and speaker owners will likely use a much more streamlined set of requests.

  • Still, this is no walk in the park. “The music field is actually probably the hardest,” Maury said.
  • Think of all the artists whose name you can’t pronounce, or all the artists, bands and songs that contain the term “Alice”. Somehow the assistant has to make sense of each of them, or people will just give up using it.
  • Sonos has a bit of a superpower at its disposal: the company uses songs and artists that people have favored in its app as the default response.
  • Instead of building an all-knowing assistant, the company effectively customizes it for each listener.

“That’s one of the benefits of racing locally,” Maury said. ” We have a [speech recognition] model per house.

—Janko Roettgers

Computer vision is not a panacea

When I first heard about Amazon’s new video calling device, the Amazon Glow, my mind immediately went to Osmo, which for years has combined digital and physical gaming with its apps and child-centric entertainment accessories. There are even tangram sets for both, allowing kids to solve number puzzles with physical puzzle pieces.

But after talking to some of the people who worked on the Glow for an in-depth article on its development posted on this week, I realized that Amazon ultimately decided to take a very different approach — and the reasons why. of this decision. show that there is no one-size-fits-all approach when it comes to creating next-generation entertainment devices.

  • Osmo uses computer vision to extend play beyond the screen. The company’s hardware includes a small clip-on mirror that redirects an iPad’s front-facing camera view to the table, turning it into a supervised play space.
  • “We looked at Osmo,” acknowledged Martin Aalund, Amazon’s senior hardware engineer and founding member of the Glow team. “They recognize objects with their camera and track those objects.”
  • However, the premise of the Glow went beyond object tracking. “We wanted an interactive screen,” Aalund said. “Actually detecting when a finger touches a surface is much more difficult.”
  • The Glow team looked at several ways to make computer vision work, including using multiple cameras and tracking the shadow of a child’s finger. Nothing really seemed good enough.
  • One problem: the cameras are easily distracted. “If you put [your device] next to a window and there’s a tree outside with swaying branches, and you have shadows moving around the play space, all of a sudden you start detecting all these fakes positive,” Aalund said.

In the end, Amazon opted for an infrared sensor that could track a person’s fingers instead of a traditional RGB camera. However, Aalund readily admitted that computer vision might one day provide even better results. “We started this five years ago,” he said. We didn’t have cameras and systems as powerful as today. This confused us a bit.

—Janko Roettgers

The digital revolution is already here – transforming the way we live, work and communicate. Intelligent infrastructure is a key part of this revolution. It brings the power of the digital world to physical components such as energy, public transport and public safety using sensors, cameras and connected devices.

Learn more

In other news

Magic Leap gets rid of its original helmet. The business-to-business pivot is complete: discount site Woot is selling the Magic Leap 1 headset, which cost $2,300, for $550 this week.

Netflix is ​​eyeing console and cloud gaming. In a lengthy survey, the company asked subscribers about their interest in playing Netflix games on TV.

Niantic is building an AR map of the world. The Pokemon Go developer has outsourced its visual positioning system, which allows developers to create persistent AR experiences in 30,000 locations.

The Netflix layoffs have had a disproportionate impact on people with marginalized identities. A recent round of layoffs has resulted in deep cuts to social media teams set up to speak to people of color and LGBTQ+ viewers.

The Metaverse gets its first world conference. The Meta Festival, scheduled for June 28, will include speakers from Netflix, Headspace, Paramount and others.

Roblox is hiring a former director of Zynga and Twitter. Nick Tornow, the former chief technology officer at Zynga, is joining Roblox as vice president of engineering for its developer team. Tornow was previously Twitter’s head of platform.

The war in Ukraine is still weighing on the development of the game. Belarusian game developer Sad Cat Studio said on Wednesday it was delaying its next superseded Xbox exclusive to 2023, citing the ongoing dispute and the impact it has had on staff members.

An NFT nightmare: Seth Green made headlines this week when his Bored Ape NFT was stolen and resold to a buyer who has no intention of returning it. This could complicate Green’s plans for an animated TV show using the NFT’s underlying art and character.

Breathe deeply

It’s easy to feel lost and overwhelmed in a week like this. Self-care obviously won’t solve all of our problems (for one thing, it won’t get rid of assault weapons), but taking a moment for yourself can at least help deal with some of the feelings these senseless tragedies leave us. . One way to do this is through guided meditation, which VR meditation company Tripp currently offers for free in its mobile app. Additionally, Tripp recently partnered with Niantic to soon integrate AR experiences into its mobile app, so you can find those moments of self-care anywhere.

—Janko Roettgers

The IIJA’s potential to shape our future is immense; if we don’t spend the funds wisely, the effects will be felt for generations. Physical infrastructure alone does not fully meet the diverse needs of our modern information-based economy and does not prepare us for future success.

Learn more

Thoughts, questions, advice? Send them to [email protected] Have a good day, see you tomorrow.