As an important milestone in our work toward emulating this ability in AI systems, we are sharing a new framework called Generalized Data Transformations. It achieves unprecedented performance in understanding the content of videos — without using labeled training data.
Rather than using a single modality, we learn from the relationships between the sound and images in a video.
Why it matters
To build truly intelligent machines, we must enable them to learn directly from the world without needing explicit guidance every step of the way. Being able to learn from the sights and sounds in the world as they occur, without explicit supervision, is the hallmark of this kind of learning.
Read the full paper:
https://arxiv.org/pdf/2003.04298.pdf
See this full story here: https://ai.facebook.com/blog/a-state-of-the-art-self-supervised-framework-for-video-understanding/