The theme for Kranky Geek 2018 last Friday was AI in communications and, judging by the record-setting attendance, this is a hot topic. Despite the air quality alert that was posted in San Francisco, over 250 developers turned out for the annual WebRTC developer convocation. It was standing room only in the Google auditorium.
We were excited to launch our new AI-powered analytics at the conference during a presentation by Varun Singh, CEO. Check-out our announcement blog post, here, and a video of Varun’s presentation, here. Now, for our summary of the many additional topics discussed during the day by Microsoft, IBM, Facebook, Google and more. All of the presentations were captured on video and are available here.
Google - Eye-popping WebRTC Growth
Google representatives released some startling growth numbers for usage of WebRTC on Chrome, which is a window into only a fraction of the ecosystem. Google said combined video and audio minutes grew 50% to 2.3B minutes/week last year and data channel usage grew 76% to 3 PB/week last year. Google estimates the total ecosystem is at least 10x larger than this!
The technical team announced enhancements to Chrome which will minimize the effects of echo, one of the most difficult problems in communications. Advanced Echo Cancellation (AEC) v3, separates the raw audio device and processing from the browser and application process. By making AEC v3 its own process, echo performance is improved and the browser becomes more stable and reliable.
The code is currently being tested in desktop platforms and is targeted for release in Chrome M71/M72.
Google isn’t resting on its WebRTC laurels: It disclosed work on WebRTC Next Version (NV), which includes new low-level APIs for networking, direct media control via Worklets/workers, support for the QUIC protocol for media and arbitrary data, and the AV1 codec. See our CommCon and TPAC articles for more information. Meanwhile, it remains committed to completing v1.0 in H1/2019, including the unified plan, webrtc stats, and simulcast.
Microsoft Universal Windows Platform - Integration with callstats.io
As it reached the culmination of its four-year long effort to add a WebRTC interface to the Universal Windows Platform (UWP), Microsoft announced it has taken steps to make it easy for developers to integrate with callstats.io. It will provide a sample based on the peer connection application that developers can use as a model for their own apps. The immediate target is AR/VR apps on HoloLens and gaming apps on Xbox, but others are on the horizon.
UWP is the native platform for Windows 10 and runs on Xbox, HoloLens, Mobile and Desktop. Microsoft is enabling application developers to build WebRTC apps on the UWP platform by implementing a comprehensive port of Google’s Win32 implementation of WebRTC. This gives UWP all the capabilities of Google Chrome plus optimizations for UWP, including support for low power devices and hardware-based AEC/AGC.
Microsoft announced it is working with Google to upstream its UWP code back into the WebRTC.org repo via the public submission process.
Voicebase - AI for Measuring Sentiment in Contact Center Calls
Voicebase is using AI to perform paralinguistic analysis of voice conversations, which measures “how something is said instead of what is said.“ Its algorithms are built upon a low level quantization of the frequencies, volume and start/stop times of words. With these metrics, Voicebase can analyze the pace of conversation, including overlaps and silence. It can also analyze pitch and energy.
By rolling this data up at a call level, it extracts a meaningful determination of whether a call to a contact center agent was successful or not. The speaker, Jeff Shukis, gave the example of a call where the caller’s pace of conversation is slower during the last one-third of the call than it was during the first one-third, then the call was likely successful. If the opposite is true, then the call was not successful. Similar ratios are calculated for pitch and energy to provide a comprehensive sentiment analysis.
Shukis said the combination of transcript and paralinguistic analysis provides the most reliable indication of the outcome of a call. Voicebase makes this analysis accessible via VoiceBase Query Language.
IBM Watson - AI-powered Contact Center Voice Bots
IBM is using its Watson platform to develop voice-driven bots that are capable of independently handling many basic customer interactions. Like a next generation IVR, voicebots can interact with callers to order a pizza or conduct more complex transactions. They can also assist contact center agents by analyzing the customer conversation in real time and providing suggestions for responses in the background.
Brian Pulito described the AI services driving voicebots, which include speech-to-text and text-to-speech, natural language classifier to understand the caller’s intent, and tone and sentiment analysis. This last component seems to be IBM’s differentiation because Brian explained it is the key to making the interaction sound natural to the caller and keeping the call from getting escalated to a live agent.
The IBM architecture is very comprehensive with call control capability for SIP and WebRTC networks, interfaces to speech services with WSS and MRCP, REST APIs connecting a full complement of AI-driven services, and SQL and REST for call recording, CDRs and other KPIs.
Dialpad - AI-based Advancements Put Proper Names into Speech Recognition
One of the biggest challenges in the speech recognition field is accurately recognizing proper names. Etienne Manderscheid of Dialpad claims the company has developed a solution that is tied with Google for the leading rank in speech recognition accuracy. This is an important advancement that improves the accuracy of contact center analytics, because product names are critical keywords used in measuring customer satisfaction. The solution also improves the accuracy of transcriptions.
The Dialpad solution, Domain Adaptation, is built from an acoustic model, a lexicon model, and a language model. Sound goes through a Kaldi-based acoustic model to create phonemes. The phonemes are passed through the lexicon model to produce words. He recommends using the CMU Dictionary and adding words to cover pronunciations from within a domain (e.g. customer service). He used the example of creating dictionary entries for “cranky geek” as well as “kranky geek.”
The key is to contextualize the way words are used in statements. By providing word context, the AI can distinguish between “I will be going to the kranky geek events in San Francisco and Bangalore this year” and “I met A cranky geek yesterday”.
We think this development can improve transcription in cases where there is a lot of use of jargon.
Facebook Portal - AI to Control Real-time Camera Movement
Facebook described how it is using AI to control movement in the smart camera that is part of its new Portal product. The camera uses a 140 degree wide-angle lens and 4K sensor to capture all the potential subjects in a scene and processes these subjects with algorithms onboard the camera. The algorithms perform what Facebook calls “scene understanding” which allows it to select a single focal point from among multiple people in the scene and zoom-in the appropriate amount to give the party on the other end of the call an exceptional experience.
The video demo was very impressive and the technical explanation even more so. The camera uses a mobile device chipset to process the algorithms in real time. It learns as it gathers more information and has the capability to smoothly recover from wrong guesses about what is happening in a scene.
Interested in working with our team and attending some fun real-time communications events? Check out our careers page.