Talk

OSWorld: Benchmarking Multimodal Agents forOpen-Ended Tasks in Real Computer Environments

avatar
Tianbao Xie
2024/4/26
Agent
Sociological Simulation
main image for this paper

Abstract

OSWorld is a novel, scalable environmentdesigned to evaluate autonomous digital agents across diverse real-worldcomputer tasks. Supporting multiple operating systems like Ubuntu, Windows, andmacOS, OSWorld enables comprehensive, execution-based evaluations of agents ininteractive settings involving web and desktop applications. Our benchmarkincludes 369 tasks derived from actual use cases, highlighting the currentlimitations of state-of-the-art agents, which achieve only a 12.24% successrate compared to humans’ 72.36%. This platform provides crucial insights foradvancing multimodal agent development. Resources are publicly available toencourage further exploration in this promising field.

Speaker

Tianbao Xie is currently a second-year PhDstudent at the University of Hong Kong, where he is advised by Tao Yu(primary), Lingpeng Kong, and Ben Kao. His primary research interests lie inArtificial Intelligence and Natural Language Processing, with a particularfocus on developing large-scale neural-symbolic AI systems and autonomousagents.

Video

Extra Details

Speaker Website / Paper Link / Paper Code/ Paper Project Page