arxiv:2606.07591

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Published on May 28

· Submitted by

Authors:

Abstract

ResearchClawBench evaluates autonomous scientific research capabilities across 40 tasks from 10 domains using expert-curated criteria and reveals current limitations in re-discovery accuracy among AI agents and LLMs.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

View arXiv page View PDF Project page GitHub 131 Add to collection

Community

black-yt

Paper submitter about 3 hours ago

black-yt

Paper submitter 19 minutes ago

We recently updated more evaluation results for agents and LLMs on our homepage. You can click to see the papers written by different agents along with their scores — it's very interesting.