LipNet - Misplaced Pages

Deep learning model for audio-visual speech recognition

This article contains close paraphrasing of a non-free copyrighted source, https://ui.adsabs.harvard.edu/abs/2016arXiv161101599A/abstract (Copyvios report). Relevant discussion may be found on the talk page. Please help Misplaced Pages by rewriting this article with your own words. (February 2021) (Learn how and when to remove this message)

LipNet is a deep neural network for audio-visual speech recognition (ASVR). It was created by University of Oxford researchers Yannis Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. The technique, outlined in a paper in November 2016, is able to decode text from the movement of a speaker's mouth. Traditional visual speech recognition approaches separated the problem into two stages: designing or learning visual features, and prediction. LipNet was the first end-to-end sentence-level lipreading model that learned spatiotemporal visual features and a sequence model simultaneously. Audio-visual speech recognition has enormous practical potential, with applications such as improved hearing aids, improving the recovery and wellbeing of critically ill patients, and speech recognition in noisy environments, implemented for example in Nvidia's autonomous vehicles.

References

Assael, Yannis M.; Shillingford, Brendan; Whiteson, Shimon; de Freitas, Nando (2016-12-16). "LipNet: End-to-End Sentence-level Lipreading". arXiv:1611.01599 .
"AI that lip-reads 'better than humans'". BBC News. November 8, 2016.
"Home Elementor". Liopa.
Vincent, James (November 7, 2016). "Can deep learning help solve lip reading?". The Verge.
Quach, Katyanna. "Revealed: How Nvidia's 'backseat driver' AI learned to read lips". www.theregister.com.

Categories: