Top model scores may be skewed by Git history leaks in SWE-bench

SWE-bench / SWE-bench Public Notifications You must be signed in to change notification settings Fork 608 Star 3.5k Repo State Loopholes During Agentic Evaluation #465New issueCopy linkNew issueCopy linkOpenOpenRepo State Loopholes During Agentic Evaluation#465Copy linkDescriptionjacobkahnopened on Sep 3, 2025Issue body actionsWe’ve identified multiple loopholes with SWE Bench Verified where agents may look at future repository state (by querying it directly or through a variety of methods), and cases in which future repository state includes either solutions or detailed approaches to solving problems (commit messages and more). Examples: A trajectory with Claude 4 Sonnet, Pytest-dev__pytest-6202 (complete output here), the agent uses git…

Read more on Hacker News