Back to talks

Pulling text out of a (mediasoup) hat and other magic tricks for a captivating live transcription performance

Live transcription is a practical and highly requested feature for video chat platforms. Transcription helps those who are hard of hearing, supports national and institutional accessibility regulations, allows people to quickly refer back to earlier parts of a call, and more. Historically, transcription services have been expensive, unreliable, or unavailable for general use.

At Daily, our primary focus is providing the best developer time to value for anyone working with WebRTC audio and video. But what about audio/video to text? This talk details the planning and design work around optimizing for live transcription and ease of use for our customers.

This talk will address the following questions:

  • Why did we decide to partner with a service provider (Deepgram) and how did we work together to deliver high-quality and AI-powered transcripts of Daily calls?
  • What approaches did we take to support additional provider-agnostic audio transcription services, for users that want to Bring Their Own Transcription?
  • What are all the technical magic tricks that went into pulling RTP audio streams out of mediasoup, packaging different streams into the easily-digestible LPCM audio stream, sending audio via web sockets, accepting the transcription output, and sending it back to clients via our signaling web socket?