Databricks' OfficeQA uncovers disconnect: AI agents ace abstract tests but stall at 45% on enterprise docs - CodeGurus

Sean Michael Kerner December 9, 2025 Credit: Image generated by VentureBeat with FLUX-2-ProThere is no shortage of AI benchmarks in the market today, with popular options like Humanity’s Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others.AI agents excel at solving abstract math problems and passing PhD-level exams that most benchmarks are based on, but Databricks has a question for the enterprise: Can they actually handle the document-heavy work most enterprises need them to do?The answer, according to new research from the data and AI platform company, is sobering. Even the best-performing AI agents achieve less than 45% accuracy on tasks that mirror real enterprise workloads, exposing a critical gap between academic benchmarks and business reality.”If we focus our research…

Related Articles