Meta’s AI researchers have released a new model that’s trained in a similar way to today’s large language models, but instead of learning from written words, it learns from video.
LLMs are normally trained on thousands of sentences or phrases where some of the words are masked, forcing the model to find the best words to fill in the blanks. In doing so they pick up a rudimentary sense of the world. Yann LeCun, who leads Meta’s FAIR (foundational AI research) group, has proposed that if AI models could use the same masking technique, but on video footage, they could learn more quickly. ...
The embodiment of LeCun’s theory is a research model called Video Joint Embedding Predictive Architecture (V-JEPA). It learns by processing unlabeled video and figuring out what probably happened in a certain part of the screen during the few seconds it was blacked out. ...
Meta’s next step after V-JEPA is to add audio to the video, which would give the model a whole new dimension of data to learn from—just like a child watching a muted TV then turning the sound up. ...
See the full story here: https://www.fastcompany.com/91029951/meta-v-jepa-yann-lecun