U.S. Patent No. 11,538,461 - Prepared by Attorney David Tran for Amazon.com, Inc. and filed by Weaver (WAVS IP)
Brief Description: Some implementations include methods for detecting missing subtitles associated with a media presentation and may include receiving an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluating the plurality of audio segments using a combination of a recurrent neural network and a convolutional neural network to identify refined speech segments associated with the audio sequence, the recurrent neural network trained based on a plurality of languages, the convolutional neural network trained based on a plurality of categories of sound; determining timestamps associated with the identified refined speech segments; and determining missing subtitles based on the timestamps associated with the identified refined speech segments and timestamps associated with subtitles included in the subtitle component. This disclosure describes techniques for identifying missing subtitles associated with a media presentation. The media presentation may include an audio component and a subtitle component. The subtitle component may include timestamps associated with subtitles. The techniques may include receiving an audio sequence associated with the audio component. The audio sequence may be divided into a plurality of audio segments of a first duration. For example, the first duration may be 800 milliseconds (ms). Each of the audio segments may be processed using voice activity detection (VAD) network to determine whether an audio segment is a speech segment. The VAD network may be configured to perform operations associated with a recurrent neural network. The VAD network may be trained to detect speech. The VAD network may be trained based on a plurality of different languages and a plurality of samples. The VAD network may be language agnostic.