philip lelyveld The world of entertainment technology


NAB 2018: IBM Watson on Refining AI for Closed Captioning

Kulczar defined the terms used in closed captioning. “It all starts with automated speech recognition (ASR), which creates a text output from an audio track,” he said. “This is not to be confused with speech-to-text, which extends ASR but is only focused on the spoken word.”

Transcription is the manual process of creating text output from audio. “And closed captioning is a service that allows audiences to interpret content accurately without audio to assist with full understanding of the presented content,” he said.

In addition to ASR, closed captioning systems need to be able to recognize and identify common sounds and noises, know when speakers shift languages, and identify and represent speaker transitions (called diarization). For the blind, audio description represents visual characteristics of the content. Closed captioning systems also need edit tools to accurately time-stamp caption data, present captions in an intelligent, coherent layout and provide an editing interface to allow for manual changes.

Today’s systems for closed captioning all offer pros and cons. Manual or human captioning is the most accurate for VOD files, but is the most expensive. Crowdsourced captioning is more cost effective and knowledge of most areas is likely, but it doesn’t support live content, is non-predictable and has the potential for malicious intent.

Machine-generated captioning is the most cost effective model and it’s consistent and trainable, but needs further development and includes the up-front costs of training. Hybrid systems offer a “safety net” to cover some of the other systems’ flaws, but don’t get rid of all of them.

See the full story here:

Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.