Automatic Foreign Language Subtitling

I had this cool idea recently and figured its worth mentioning here.
A weird but common issue with films is that often there can be a scene where characters speak in a foreign language to the language of the film. This is often done intentionally to preserve thematics and obfuscate what they are saying, however sometimes it’s not clear if this is intentional and there can be extended dialogue which confuses the viewer.

Most importantly, often times this audio is not represented in subtitle files, or when it is having the subtitles on during understood spoken language dialogue is distracting.

The idea was to be able to intelligently display subtitles for foreign languages only when they are spoken.

To achieve this one could possibly:

  • Track the existing language via meta data
  • Train a NN on language classification
  • Read ahead and check to see if the language deviates
  • If so, display subtitles (potentially checking for deviation in the text as well) only during that period, in fact some deeper analysis of different srt files across languages might actually prove valuable to this prediction
  • Alternatively offer an option to transcode using STT, potentially even offline with a model like DeepSpeech, and then feed through a translation model like Google Translate.

I totally get that this is a non trivial idea, but I wanted to write it down anyway since I think it really stands out as something which is truly intelligent and only really possible within very recent advances in DL and Edge Hardware.

1 Like

Or simply use a subtitle file that contains the relevant only

Hint:
Forced subtitle

1 Like

If only it were that simple. It seems that the existence and reliability of forced subtitles varies greatly between releases, as shown by this 47 page thread attempting catalogue and describe methodology for reliably extracting them.

The goal is the same “Automatic Foreign Language Subtitling”, however I suspect implementing it in a fool proof way will require some combination of intelligent subtitle suppression and possible real time ASR and translation to reliably support any media and any language.

I am open to suggestions on other approaches to achieve this in a completely automated way. However having worked with DeepSpeech for ASR and NLP tools for text translation, and given the enormous power of the Apple TV hardware this is not out of the reach of possibility, albeit a very challenging task for which the initial use case appears small. Later however I would fully imagine we would expect this kind of real time contextual understanding to be developed for media viewing to augment the experience in other unanticipated ways.