How Google “Translates” Pictures Into Words Using Vector Space Mathematics
Now Oriol Vinyals and pals at Google are using a similar approach to translate images into words. Their technique is to use a neural network to study a dataset of 100,000 images and their captions and so learn how to classify the content of images.
But instead of producing a set of words that describe the image, their algorithm produces a vector that represents the relationship between the words. This vector can then be plugged into Google’s existing translation algorithm to produce a caption in English, or indeed in any other language. In effect, Google’s machine learning approach has learnt to “translate” images into words.
To test the efficacy of this approach, they used human evaluators recruited from Amazon’s Mechanical Turk to rate captions generated automatically in this way along with those generated by other automated approaches and by humans.
The results show that the new system, which Google calls Neural Image Caption, fares well. Using a well known dataset of images called PASCAL, Neural image Capture clearly outperformed other automated approaches. “NIC yielded a BLEU score of 59, to be compared to the current state-of-the-art of 25, while human performance reaches 69,” says Vinyals and co.
That’s not bad and the approach looks set to get better as the size of the training datasets increases.
Pages
- About Philip Lelyveld
- Mark and Addie Lelyveld Biographies
- Presentations and articles
- Trustworthy AI – A Market-Driven approach
- Tufts Alumni Bio