New paper (on an old AI) tests o1 against doctors on medical benchmarks & real ER cases: “across a variety of scenarios and applications, the large language model outperformed both human physicians and older models” The potential suggests an “urgent need for prospective trials.”
