philip lelyveld The world of entertainment technology

2Apr/26Off

AI benchmarks are broken. Here’s what we need instead.

...

It soon became clear that the benchmark tests on which medical AI models are assessed do not capture how medical decisions are actually made. Hospitals rely on multidisciplinary teams—radiologists, oncologists, physicists, nurses—who jointly review patients. Treatment planning rarely hinges on a static decision; it evolves as new information emerges over days or weeks. Decisions often arise through constructive debate and trade-offs between professional standards, patient preferences, and the shared goal of long-term patient well-being. No wonder even highly scored AI models struggle to deliver the promised performance once they encounter the complex, collaborative processes of real clinical care. ...

When high benchmark scores fail to translate into real-world performance, even the most highly scored AI is soon abandoned to what I call the “AI graveyard.” ...

HAIC benchmarks reframe current benchmarking in four ways: 

1.     From individual and single-task performance to team and workflow performance (shifting the unit of analysis)

2.     From one-off testing with right/wrong answers to long-term impacts (expanding the time horizon)

3.     From correctness and speed to organizational outcomes, coordination quality, and error detectability (expanding outcome measures)

4.     From isolated outputs to upstream and downstream consequences (system effects)

...

See the full story here: https://www.technologyreview.com/2026/03/31/1134833/ai-benchmarks-are-broken-heres-what-we-need-instead/

Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.