The team behind the development of this nifty feature has now explained how they worked on it. Google says the feature leverages recent developments in on-device machine learning to transcribe speech, recognise audio events, suggest tags for titles, and help users navigate transcripts.
Google’s speaker diarization system
Speaker Labels are powered by Turn-to-Diarize, Google’s new speaker diarization system – is the process of partitioning an input audio stream into segments as per the speaker identity. Google’s speaker diarization system has three main segments.
- The first is ‘speaker turn detection’ that detects a change of speaker in the input speech. It converts the acoustic features into text transcripts which are further augmented with a special token representing a speaker turn.
- The second is the ‘speaker encoder model’ that extracts voice characteristics from each speaker turn. “Once the audio recording has been segmented into homogeneous speaker turns, we use a speaker encoder model to extract an embedding vector to represent the voice characteristics of each speaker turn,” the company said.
- The third is a ‘multi-stage clustering algorithm’ that is used to determine whether there are at least two different speakers in the recording and then annotates each speaker.
Correction and Customisation
The recorder app also makes corrections in real-time to automatically update the speaker labels on the screen and reflect the most accurate predictions. “As the model consumes more audio input, it accumulates confidence on predicted speaker labels, and may occasionally make corrections to previously predicted low-confidence speaker labels,” Google said.
Google Pixel 7 launched in India. Hands on and first look
For all the latest Technology News Click Here
For the latest news and updates, follow us on Google News.