AI’s understanding and reasoning skills can’t be assessed by current tests
PhilNote: this is a really nerdy paper on the various approaches that researchers are taking to determine whether and when an AI "understands" what it was doing. It goes into the flaws of each technique. The conclusion is that an 'understanding test' is a complex moving target that we may never fully solve. For me the most interesting and disturbing finding from one of their evaluations was "Surprisingly, when the researchers investigated the models’ answers at each sub-step, they found that even when the final answers were right, the underlying calculations and reasoning — the answers at each sub-step — could be completely wrong."
... But “AI surpassing humans on a benchmark that is named after a general ability is not the same as AI surpassing humans on that general ability,” computer scientist Melanie Mitchell pointed out in a May edition of her Substack newsletter. ...But “AI surpassing humans on a benchmark that is named after a general ability is not the same as AI surpassing humans on that general ability,” computer scientist Melanie Mitchell pointed out in a May edition of her Substack newsletter. ... But “AI surpassing humans on a benchmark that is named after a general ability is not the same as AI surpassing humans on that general ability,” computer scientist Melanie Mitchell pointed out in a May edition of her Substack newsletter. ...
The Winograd Schema Challenge, or WSC, was proposed in 2011 as a test for intelligent behavior of a system. Though many people are familiar with the Turing test as a way to evaluate intelligence, researchers had begun to propose modifications and alternatives that weren’t as subjective and didn’t require the AI to engage in deception to pass the test (SN: 6/15/12).
Instead of a free-form conversation, WSC features pairs of sentences that mention two entities and use a pronoun to refer to one of the entities. Here’s an example pair:
Sentence 1: In the storm, the tree fell down and crashed through the roof of my house. Now, I have to get it removed.
Sentence 2: In the storm, the tree fell down and crashed through the roof of my house. Now, I have to get it repaired.
A language model scores correctly if it can successfully match the pronoun (“it”) to the right entity (“the roof” or “the tree”). The sentences usually differ by a special word (“removed” or “repaired”) that when exchanged changes the answer. Presumably only a model that relies on commonsense world knowledge and not linguistic clues could provide the correct answers.
But it turns out that in WSC, there are statistical associations that offer clues. Consider the example above. Large language models, trained on huge amounts of text, would have encountered many more examples of a roof being repaired than a tree being repaired. A model might select the statistically more likely word among the two options rather than rely on any kind of commonsense reasoning. ...
For some researchers, the fact that LLMs are passing benchmarks so easily simply means that more comprehensive benchmarks need developing. For instance, researchers might turn to a collection of varied benchmark tasks that tackle different facets of common sense such as conceptual understanding or the ability to plan future scenarios. ...
But others are more skeptical that models performing great on the benchmarks necessarily possesses the cognitive abilities in question. If a model tests well on a dataset, it just tells us that it performs well on that particular dataset and nothing more, Elazar says. ...
Taking a different approach to testing
Systematically digging into the mechanisms required for understanding may offer more insight than benchmark tests, Arakelyan says. That might mean testing AI’s underlying grasp of concepts using what are called counterfactual tasks. In these cases, the model is presented with a twist on a commonplace rule that it is unlikely to have encountered in training, say an alphabet with some of the letters mixed up, and asked to solve problems using the new rule. ...
To try to get a better sense of language understanding, the team compared how a model answered the standard test with how it answered when given the same premise sentence but with slightly paraphrased hypothesis sentences. A model with true language understanding, the researchers say, would make the same decisions as long as the slight alteration preserves the original meaning and logical relationships. ...
But for a sizable number of sentences, the models tested changed their decision, sometimes even switching from “implies” to “contradicts.” When the researchers used sentences that did not appear in the training data, the LLMs changed as many as 58 percent of their decisions.
“This essentially means that models are very finicky when understanding meaning,” Arakelyan says. This type of framework, unlike benchmark datasets, can better reveal whether a model has true understanding or whether it is relying on clues like the distribution of the words. ...
Surprisingly, when the researchers investigated the models’ answers at each sub-step, they found that even when the final answers were right, the underlying calculations and reasoning — the answers at each sub-step — could be completely wrong. This confirms that the model sometimes relies on memorization, Dziri says. Though the answer might be right, it doesn’t say anything about the LLM’s ability to generalize to harder problems of the same nature — a key part of true understanding or reasoning. ...
In truth, a perfect AI evaluation might never exist. The more language models improve, the harder tests will have to get to provide any meaningful assessment. ...
See the full paper here: https://www.sciencenews.org/article/ai-understanding-reasoning-skill-assess
Pages
- About Philip Lelyveld
- Mark and Addie Lelyveld Biographies
- Presentations and articles
- Tufts Alumni Bio